Following the tutorial at:



In [24]:

    
import pandas as pd



In [25]:

    
# There are two data structures in pandas, Series and DataFrames
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])



In [26]:

    
pd.DataFrame({"City Name": city_names, "Population": population})









    Out[26]:







  
    
      
      City Name
      Population
    
  
  
    
      0
      San Francisco
      852469
    
    
      1
      San Jose
      1015785
    
    
      2
      Sacramento
      485199



In [27]:

    
# importing an existing csv file into DataFrame
california_housing_dataframe = pd.read_csv(
    "https://storage.googleapis.com/mledu-datasets/california_housing_train.csv",
    sep=","
)



In [28]:

    
california_housing_dataframe.shape









    Out[28]:





(17000, 9)



In [29]:

    
california_housing_dataframe.head()









    Out[29]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      0
      -114.31
      34.19
      15.0
      5612.0
      1283.0
      1015.0
      472.0
      1.4936
      66900.0
    
    
      1
      -114.47
      34.40
      19.0
      7650.0
      1901.0
      1129.0
      463.0
      1.8200
      80100.0
    
    
      2
      -114.56
      33.69
      17.0
      720.0
      174.0
      333.0
      117.0
      1.6509
      85700.0
    
    
      3
      -114.57
      33.64
      14.0
      1501.0
      337.0
      515.0
      226.0
      3.1917
      73400.0
    
    
      4
      -114.57
      33.57
      20.0
      1454.0
      326.0
      624.0
      262.0
      1.9250
      65500.0



In [30]:

    
california_housing_dataframe.hist('housing_median_age')









    Out[30]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f1ce0fe0e80>]],
      dtype=object)

Accessing Data

You can access DataFrame data using familiar Python dict/list operations:



In [31]:

    
cities = pd.DataFrame({'City Name': city_names, 'Population': population})
print(type(cities['City Name']))
cities['City Name']









    



<class 'pandas.core.series.Series'>






    Out[31]:





0    San Francisco
1         San Jose
2       Sacramento
Name: City Name, dtype: object



In [32]:

    
print(type(cities["City Name"][1]))
cities["City Name"][1]









    



<class 'str'>






    Out[32]:





'San Jose'



In [33]:

    
print(type(cities[0:2]))
cities[0:2]









    



<class 'pandas.core.frame.DataFrame'>






    Out[33]:







  
    
      
      City Name
      Population
    
  
  
    
      0
      San Francisco
      852469
    
    
      1
      San Jose
      1015785

Manipulating Data

You may apply Python's basic arithmetic operations to Series. For example:



In [36]:

    
population / 1000









    Out[36]:





0     852.469
1    1015.785
2     485.199
dtype: float64



In [37]:

    
import numpy as np
np.log(population)









    Out[37]:





0    13.655892
1    13.831172
2    13.092314
dtype: float64



In [40]:

    
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities









    Out[40]:







  
    
      
      City Name
      Population
      Area square miles
      Population density
    
  
  
    
      0
      San Francisco
      852469
      46.87
      18187.945381
    
    
      1
      San Jose
      1015785
      176.53
      5754.177760
    
    
      2
      Sacramento
      485199
      97.92
      4955.055147



In [39]:

    
population.apply(lambda val: val > 1000000)









    Out[39]:





0    False
1     True
2    False
dtype: bool

Exercise #1

Modify the cities table by adding a new boolean column that is True if and only if both of the following are True:

The city is named after a saint.
The city has an area greater than 50 square miles.

Note: Boolean Series are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing logical and, use & instead of and.

Hint: "San" in Spanish means "saint."



In [46]:

    
cities['is saint and wide'] = (cities['Area square miles'] > 50) & (cities['City Name'].apply(lambda name: name.startswith("San")))
cities









    Out[46]:







  
    
      
      City Name
      Population
      Area square miles
      Population density
      is saint and wide
    
  
  
    
      0
      San Francisco
      852469
      46.87
      18187.945381
      False
    
    
      1
      San Jose
      1015785
      176.53
      5754.177760
      True
    
    
      2
      Sacramento
      485199
      97.92
      4955.055147
      False

Indexes

Both Series and DataFrame objects also define an index property that assigns an identifier value to each Series item or DataFrame row.

By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.



In [47]:

    
city_names.index









    Out[47]:





RangeIndex(start=0, stop=3, step=1)



In [48]:

    
cities.index









    Out[48]:





RangeIndex(start=0, stop=3, step=1)



In [50]:

    
cities.reindex([2, 0, 1])









    Out[50]:







  
    
      
      City Name
      Population
      Area square miles
      Population density
      is saint and wide
    
  
  
    
      2
      Sacramento
      485199
      97.92
      4955.055147
      False
    
    
      0
      San Francisco
      852469
      46.87
      18187.945381
      False
    
    
      1
      San Jose
      1015785
      176.53
      5754.177760
      True

Reindexing is a great way to shuffle (randomize) a DataFrame. In the example below, we take the index, which is array-like, and pass it to NumPy's random.permutation function, which shuffles its values in place. Calling reindex with this shuffled array causes the DataFrame rows to be shuffled in the same way.



In [52]:

    
cities.reindex(np.random.permutation(cities.index))









    Out[52]:







  
    
      
      City Name
      Population
      Area square miles
      Population density
      is saint and wide
    
  
  
    
      1
      San Jose
      1015785
      176.53
      5754.177760
      True
    
    
      2
      Sacramento
      485199
      97.92
      4955.055147
      False
    
    
      0
      San Francisco
      852469
      46.87
      18187.945381
      False

Exercise #2

The reindex method allows index values that are not in the original DataFrame's index values. Try it and see what happens if you use such values! Why do you think this is allowed?



In [53]:

    
cities.reindex([4, 2, 1, 3, 0])









    Out[53]:







  
    
      
      City Name
      Population
      Area square miles
      Population density
      is saint and wide
    
  
  
    
      4
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      Sacramento
      485199.0
      97.92
      4955.055147
      False
    
    
      1
      San Jose
      1015785.0
      176.53
      5754.177760
      True
    
    
      3
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      0
      San Francisco
      852469.0
      46.87
      18187.945381
      False



In [ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

	City Name	Population	Area square miles	Population density
0	San Francisco	852469	46.87	18187.945381
1	San Jose	1015785	176.53	5754.177760
2	Sacramento	485199	97.92	4955.055147

	City Name	Population	Area square miles	Population density	is saint and wide
4	NaN	NaN	NaN	NaN	NaN
2	Sacramento	485199.0	97.92	4955.055147	False
1	San Jose	1015785.0	176.53	5754.177760	True
3	NaN	NaN	NaN	NaN	NaN
0	San Francisco	852469.0	46.87	18187.945381	False